Introduction: The housing market in Melbourne is a complex and dynamic environment where various factors interplay to determine the value of properties. In this project, we delve into an exploratory data analysis (EDA) of the Melbourne housing market to uncover key insights into how property prices are influenced by both intrinsic characteristics, such as the number of rooms, and external factors, including location and proximity to the city center.
The goal of this analysis is to identify patterns and relationships
within the dataset that drive property valuations. By the end of the
project, we aim to develop a comprehensive understanding of the factors
that most significantly affect house prices. This will culminate in a
regression analysis, where we will quantify the impact of these
variables and create a predictive model that can be used to estimate
property values based on the identified features.
Importing Data and Dependencies
library(ggplot2)
library(dplyr)
library(GGally)
library(lubridate)
library(corrplot)
library(scales)
library(broom)
library(car)
#Importing Data
data <- read.csv("Data/MELBOURNE_HOUSE_PRICES_LESS.csv", stringsAsFactor = FALSE)
head(data)
## Suburb Address Rooms Type Price Method SellerG Date
## 1 Abbotsford 49 Lithgow St 3 h 1490000 S Jellis 1/04/2017
## 2 Abbotsford 59A Turner St 3 h 1220000 S Marshall 1/04/2017
## 3 Abbotsford 119B Yarra St 3 h 1420000 S Nelson 1/04/2017
## 4 Aberfeldie 68 Vida St 3 h 1515000 S Barry 1/04/2017
## 5 Airport West 92 Clydesdale Rd 2 h 670000 S Nelson 1/04/2017
## 6 Airport West 4/32 Earl St 2 t 530000 S Jellis 1/04/2017
## Postcode Regionname Propertycount Distance
## 1 3067 Northern Metropolitan 4019 3.0
## 2 3067 Northern Metropolitan 4019 3.0
## 3 3067 Northern Metropolitan 4019 3.0
## 4 3040 Western Metropolitan 1543 7.5
## 5 3042 Western Metropolitan 3464 10.4
## 6 3042 Western Metropolitan 3464 10.4
## CouncilArea
## 1 Yarra City Council
## 2 Yarra City Council
## 3 Yarra City Council
## 4 Moonee Valley City Council
## 5 Moonee Valley City Council
## 6 Moonee Valley City Council
1- Overview of the Dataset
glimpse(data)
## Rows: 63,023
## Columns: 13
## $ Suburb <chr> "Abbotsford", "Abbotsford", "Abbotsford", "Aberfeldie", …
## $ Address <chr> "49 Lithgow St", "59A Turner St", "119B Yarra St", "68 V…
## $ Rooms <int> 3, 3, 3, 3, 2, 2, 2, 3, 6, 3, 3, 4, 2, 4, 2, 4, 3, 2, 2,…
## $ Type <chr> "h", "h", "h", "h", "h", "t", "u", "h", "h", "h", "u", "…
## $ Price <int> 1490000, 1220000, 1420000, 1515000, 670000, 530000, 5400…
## $ Method <chr> "S", "S", "S", "S", "S", "S", "S", "SP", "PI", "S", "S",…
## $ SellerG <chr> "Jellis", "Marshall", "Nelson", "Barry", "Nelson", "Jell…
## $ Date <chr> "1/04/2017", "1/04/2017", "1/04/2017", "1/04/2017", "1/0…
## $ Postcode <int> 3067, 3067, 3067, 3040, 3042, 3042, 3042, 3042, 3021, 32…
## $ Regionname <chr> "Northern Metropolitan", "Northern Metropolitan", "North…
## $ Propertycount <int> 4019, 4019, 4019, 1543, 3464, 3464, 3464, 3464, 1899, 32…
## $ Distance <dbl> 3.0, 3.0, 3.0, 7.5, 10.4, 10.4, 10.4, 10.4, 14.0, 3.0, 1…
## $ CouncilArea <chr> "Yarra City Council", "Yarra City Council", "Yarra City …
summary(data)
## Suburb Address Rooms Type
## Length:63023 Length:63023 Min. : 1.000 Length:63023
## Class :character Class :character 1st Qu.: 3.000 Class :character
## Mode :character Mode :character Median : 3.000 Mode :character
## Mean : 3.111
## 3rd Qu.: 4.000
## Max. :31.000
##
## Price Method SellerG Date
## Min. : 85000 Length:63023 Length:63023 Length:63023
## 1st Qu.: 620000 Class :character Class :character Class :character
## Median : 830000 Mode :character Mode :character Mode :character
## Mean : 997898
## 3rd Qu.: 1220000
## Max. :11200000
## NA's :14590
## Postcode Regionname Propertycount Distance
## Min. :3000 Length:63023 Min. : 39 Min. : 0.00
## 1st Qu.:3056 Class :character 1st Qu.: 4380 1st Qu.: 7.00
## Median :3107 Mode :character Median : 6795 Median :11.40
## Mean :3126 Mean : 7618 Mean :12.68
## 3rd Qu.:3163 3rd Qu.:10412 3rd Qu.:16.70
## Max. :3980 Max. :21650 Max. :64.10
##
## CouncilArea
## Length:63023
## Class :character
## Mode :character
##
##
##
##
In this initial step, we can already observe some interesting insights. First, we now have a clear understanding of all the variables present in the dataset. Additionally, certain details stand out, such as the maximum number of rooms, which is unusually high. Although 31 rooms is not impossible, it is significantly far from the median, warranting further investigation.
We also noticed that the Price variable contains missing
values (NaNs). Given that Price is a crucial variable for
this analysis, it is essential to address these missing values to ensure
the accuracy and reliability of our results. Additionally, we have
identified that the data types of some variables may need
correction.
In the subsequent steps, we will delve deeper into these observations, starting with a thorough exploration of the outliers and a comprehensive strategy for handling missing data.
2- Data Pre-processing
2.1 - Data Types
data <- data %>%
mutate(across(c(Suburb, Type, Method, Regionname, CouncilArea),as.factor),
Date = as.Date(Date, format = "%d/%m/%Y"))
2.2 - Missing Values
#Dealing with Missing Values
na_sum <- data %>%
filter(is.na(Price) > 0) %>%
nrow()
freq_na <- round(na_sum/nrow(data) * 100, digits = 2)
#Ploting NA %
na_data <- data.frame(
category = c("NA", "Not NA"),
count = c(na_sum, nrow(data) - na_sum)
)
ggplot(na_data, aes(x = "", y = count, fill = category)) +
geom_bar(width = 1, stat = "identity") +
coord_polar("y", start = 0) +
labs(title = "Frequency of NA Values in the Price Column") +
theme_void() +
theme(legend.title = element_blank()) +
scale_fill_manual(values = c("NA" = "lightblue", "Not NA" = "skyblue")) +
geom_text(aes(label = paste0(round(count / sum(count) * 100, 2), "%")),
position = position_stack(vjust = 0.5))
#Remove NA'S
data_cleaned <- data %>%
filter(rowSums(is.na(.)) == 0)
glimpse(data_cleaned)
## Rows: 48,433
## Columns: 13
## $ Suburb <fct> Abbotsford, Abbotsford, Abbotsford, Aberfeldie, Airport …
## $ Address <chr> "49 Lithgow St", "59A Turner St", "119B Yarra St", "68 V…
## $ Rooms <int> 3, 3, 3, 3, 2, 2, 2, 3, 3, 3, 4, 2, 4, 2, 3, 2, 2, 3, 2,…
## $ Type <fct> h, h, h, h, h, t, u, h, h, u, h, h, h, h, h, u, h, h, u,…
## $ Price <int> 1490000, 1220000, 1420000, 1515000, 670000, 530000, 5400…
## $ Method <fct> S, S, S, S, S, S, S, SP, S, S, S, S, S, SP, S, S, S, S, …
## $ SellerG <chr> "Jellis", "Marshall", "Nelson", "Barry", "Nelson", "Jell…
## $ Date <date> 2017-04-01, 2017-04-01, 2017-04-01, 2017-04-01, 2017-04…
## $ Postcode <int> 3067, 3067, 3067, 3040, 3042, 3042, 3042, 3042, 3206, 30…
## $ Regionname <fct> Northern Metropolitan, Northern Metropolitan, Northern M…
## $ Propertycount <int> 4019, 4019, 4019, 1543, 3464, 3464, 3464, 3464, 3280, 21…
## $ Distance <dbl> 3.0, 3.0, 3.0, 7.5, 10.4, 10.4, 10.4, 10.4, 3.0, 10.5, 1…
## $ CouncilArea <fct> Yarra City Council, Yarra City Council, Yarra City Counc…
summary(data_cleaned)
## Suburb Address Rooms Type
## Reservoir : 1067 Length:48433 Min. : 1.000 h:34161
## Bentleigh East: 696 Class :character 1st Qu.: 2.000 t: 4980
## Richmond : 642 Mode :character Median : 3.000 u: 9292
## Craigieburn : 598 Mean : 3.072
## Preston : 593 3rd Qu.: 4.000
## Mount Waverley: 556 Max. :31.000
## (Other) :44281
## Price Method SellerG Date
## Min. : 85000 S :30624 Length:48433 Min. :2016-01-28
## 1st Qu.: 620000 SP : 6480 Class :character 1st Qu.:2016-12-03
## Median : 830000 PI : 5940 Mode :character Median :2017-08-26
## Mean : 997898 VB : 5024 Mean :2017-07-31
## 3rd Qu.: 1220000 SA : 365 3rd Qu.:2018-03-03
## Max. :11200000 PN : 0 Max. :2018-10-13
## (Other): 0
## Postcode Regionname Propertycount
## Min. :3000 Northern Metropolitan :13598 Min. : 39
## 1st Qu.:3051 Southern Metropolitan :12549 1st Qu.: 4280
## Median :3103 Western Metropolitan : 9680 Median : 6567
## Mean :3123 Eastern Metropolitan : 7585 Mean : 7566
## 3rd Qu.:3163 South-Eastern Metropolitan: 4010 3rd Qu.:10412
## Max. :3980 Northern Victoria : 455 Max. :21650
## (Other) : 556
## Distance CouncilArea
## Min. : 0.0 Darebin City Council : 3462
## 1st Qu.: 7.0 Boroondara City Council: 3455
## Median :11.7 Banyule City Council : 2902
## Mean :12.7 Brimbank City Council : 2720
## 3rd Qu.:16.7 Moreland City Council : 2519
## Max. :55.8 Bayside City Council : 2495
## (Other) :30880
2.3- Duplicate Values
dup <- data_cleaned %>%
filter(duplicated(.))
dup
## Suburb Address Rooms Type Price Method SellerG Date
## 1 Fitzroy North 5/16 Taplin St 2 h 1010000 SP Woodards 2018-05-05
## Postcode Regionname Propertycount Distance CouncilArea
## 1 3068 Northern Metropolitan 6244 3.6 Moreland City Council
data_cleaned <- data_cleaned %>%
distinct()
Upon reviewing the dataset, we learned that 23.15% of the values in the Price column are missing. While imputing missing values is a common technique to preserve data and maintain sample size, the decision to do so must be carefully weighed against the potential risks it introduces, particularly when dealing with a large proportion of missing data.
Given the significant proportion of missing values in Price, Imputing such a large percentage of missing values might cause the model to “learn” from artificially introduced data rather than the underlying true patterns. This can result in a model that performs well on training data but fails to generalize to new, unseen data. The imputed values, especially if based on simple methods like mean or median, could introduce patterns that do not exist in the real data, leading to a false sense of accuracy. Additionally, even after removing the entries with missing values, we still have a substantial number of records—48,433 to be precise. This is more than sufficient to continue with a robust analysis. Regarding duplicated values, we identified and eliminated a duplicate entry, ensuring that the dataset is clean and free from redundancy.
2.4- Dealing with outliers
room_dist <- ggplot(data_cleaned, aes(x = as.factor(Rooms))) +
geom_bar(fill = "skyblue", color = "white") +
labs(x = "Rooms", y = "Count", title = "Frequency of Number of Rooms") +
theme_classic()
room_dist
rooms_freq <- as.data.frame(table(data_cleaned$Rooms))
rooms_freq
## Var1 Freq
## 1 1 1670
## 2 2 10673
## 3 3 21812
## 4 4 11576
## 5 5 2350
## 6 6 283
## 7 7 36
## 8 8 19
## 9 9 2
## 10 10 6
## 11 11 1
## 12 12 2
## 13 16 1
## 14 31 1
data_cleaned<- data_cleaned %>% filter(Rooms <= 8 )
hist(data$Price, main = "Frequency of Price", xlab = "Price", col = "skyblue", border = "white",)
hist(data_cleaned$Propertycount, main = "Frequency of Property Count", xlab = "Property Count", col = "skyblue",
border = "white")
hist(data_cleaned$Distance, main = "Frequency of Distance", xlab = "Distance", col = "skyblue", border = "white",)
ggplot(data_cleaned, aes(x=Distance))+
geom_boxplot()
data_cleaned$Log_Price <- log(data_cleaned$Price)
hist(data_cleaned$Price, main="Frequency of Original Price", xlab="Price", col = "skyblue", border = "white",)
hist(data_cleaned$Log_Price, main="Frequency of Log-Transformed Price", xlab="Log(Price)", col = "skyblue", border = "white")
In this analysis, our goal is to conduct a comprehensive market study, which means we will aim to retain most of the outliers. Many of the continuous variables in our dataset contain some proportion of outliers.
Let’s start by analyzing the Price variable. This variable exhibits a significant right skew. The outliers here do not appear to be imputation errors. If our focus were on a more conservative market analysis, we might have opted to remove these values. However, this is not our objective in this analysis. Instead, we have chosen to retain these outliers and apply a log transformation to Price to normalize its distribution.
When it comes to the Rooms variable, this was the only variable where we decided to remove some values. Specifically, we chose to exclude houses with more than 8 rooms. This decision was based on the relatively small number of observations in each of these higher categories, which could influence the results.
For the rest of the variables, we have decided to keep the outliers
intact, consistent with our goal of conducting a comprehensive market
analysis. When we proceed to the regression analysis, we will evaluate
the impact of this decision on our model.
3- Exploratory Data Analysis (EDA)
Now that
we have cleaned our data, we can start analyzing our variables to
uncover meaningful insights. Our first step will involve exploring the
distributions and relationships between key variables to understand
their impact on property prices. We will divide this section into three
parts:
3.1 Univariate Analysis: We will begin by examining the distribution of individual variables. This includes looking at the range, central tendency (mean, median), and variability (standard deviation) for numerical variables such as Rooms, Distance, and Propertycount. For categorical variables like Suburb, Type, Method, and CouncilArea, we will analyze the frequency distribution to identify the most common categories.
3.2 Bivariate Analysis: Next, we will explore the relationships between pairs of variables. This will include scatter plots and correlation matrices to identify potential linear relationships between Price and other numerical variables, as well as box plots to examine how Price varies across different categories, such as Type or Regionname.
3.3 Multivariate Analysis: Finally, we will perform a multivariate analysis to understand the combined effect of multiple variables on Price. Techniques such as regression analysis will be used to quantify the impact of each variable while controlling for others.
3.1 Univariate Analysis Numerical
Variables
ggplot(data_cleaned, aes(Log_Price)) +
geom_histogram(fill = "skyblue", color = "white")+
labs(x = "Log(Price)", y = "Frequency", title = "Frequency of Log_Price" ) +
theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data_cleaned, aes(Rooms)) +
geom_bar(fill = "skyblue", color = "white")+
labs(x = "Rooms", y = "Frequency", title = "Frequency of Rooms" ) +
theme_minimal()
ggplot(data_cleaned, aes(Propertycount)) +
geom_histogram(fill = "skyblue", color = "white")+
labs(x = "Property Count", y = "Frequency", title = "Frequency of Propertycount" ) +
theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
ggplot(data_cleaned, aes(Distance)) +
geom_histogram(fill = "skyblue", color = "white")+
labs(x = "Distance", y = "Frequency", title = "Frequency of Distance" ) +
theme_minimal()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Categorical Variables
summary(data_cleaned)
## Suburb Address Rooms Type
## Reservoir : 1067 Length:48419 Min. :1.000 h:34147
## Bentleigh East: 696 Class :character 1st Qu.:2.000 t: 4980
## Richmond : 642 Mode :character Median :3.000 u: 9292
## Craigieburn : 598 Mean :3.069
## Preston : 593 3rd Qu.:4.000
## Mount Waverley: 556 Max. :8.000
## (Other) :44267
## Price Method SellerG Date
## Min. : 85000 S :30616 Length:48419 Min. :2016-01-28
## 1st Qu.: 620000 SP : 6477 Class :character 1st Qu.:2016-12-03
## Median : 830000 PI : 5938 Mode :character Median :2017-08-26
## Mean : 997456 VB : 5023 Mean :2017-07-31
## 3rd Qu.: 1220000 SA : 365 3rd Qu.:2018-03-03
## Max. :11200000 PN : 0 Max. :2018-10-13
## (Other): 0
## Postcode Regionname Propertycount
## Min. :3000 Northern Metropolitan :13593 Min. : 39
## 1st Qu.:3051 Southern Metropolitan :12545 1st Qu.: 4280
## Median :3103 Western Metropolitan : 9680 Median : 6567
## Mean :3123 Eastern Metropolitan : 7583 Mean : 7567
## 3rd Qu.:3163 South-Eastern Metropolitan: 4008 3rd Qu.:10412
## Max. :3980 Northern Victoria : 455 Max. :21650
## (Other) : 555
## Distance CouncilArea Log_Price
## Min. : 0.0 Darebin City Council : 3462 Min. :11.35
## 1st Qu.: 7.0 Boroondara City Council: 3454 1st Qu.:13.34
## Median :11.7 Banyule City Council : 2900 Median :13.63
## Mean :12.7 Brimbank City Council : 2720 Mean :13.68
## 3rd Qu.:16.7 Moreland City Council : 2517 3rd Qu.:14.01
## Max. :55.8 Bayside City Council : 2495 Max. :16.23
## (Other) :30871
n_suburb <- length(levels(data_cleaned$Suburb))
paste("There are", n_suburb, "levels in Surbub")
## [1] "There are 380 levels in Surbub"
There are to many levels in Suburb. Lets see the 20 suburbs with more observations
suburb_frequencies<- data_cleaned %>%
group_by(Suburb) %>%
summarise(count = n()) %>%
arrange(desc(count))
top20_suburb <- suburb_frequencies %>%
slice_head(n = 20)
ggplot(suburb_frequencies, aes(x = count)) +
geom_histogram(bins = 30, fill = "skyblue", color = "white") +
labs(title = "Histogram of Suburb Frequencies", x = "Number of Observations", y = "Frequency") +
theme_minimal()
top20_suburb
## # A tibble: 20 × 2
## Suburb count
## <fct> <int>
## 1 Reservoir 1067
## 2 Bentleigh East 696
## 3 Richmond 642
## 4 Craigieburn 598
## 5 Preston 593
## 6 Mount Waverley 556
## 7 Brunswick 540
## 8 Northcote 496
## 9 Cheltenham 493
## 10 Glen Waverley 486
## 11 Essendon 485
## 12 Glenroy 482
## 13 Coburg 464
## 14 Mill Park 454
## 15 South Yarra 436
## 16 Glen Iris 434
## 17 Pascoe Vale 433
## 18 Kew 430
## 19 Bundoora 429
## 20 Hawthorn 428
representatiion_t10s <- round(((sum(top20_suburb$count)/nrow(data_cleaned)) * 100), digits = 2)
paste("Top 20 Suburbs observation represent", representatiion_t10s, "% of all observation")
## [1] "Top 20 Suburbs observation represent 21.98 % of all observation"
The majority of suburbs have fewer than 200 observations, indicating a wide distribution of data across different areas. The top 20 suburbs with the most observations account for only 21.98% of the total data, highlighting the fragmented nature of the dataset.
This suggests that while these top 20 suburbs provide a significant amount of data, much of the market analysis will also need to consider the long tail of less represented suburbs. It could be interesting to study the price fluctuations between these 20 most common suburbs and the less represented ones, to see if there are significant differences in price trends that could inform broader market insights.
region_frequencies <- as.data.frame(table(data_cleaned$Regionname))
region_frequencies
## Var1 Freq
## 1 Eastern Metropolitan 7583
## 2 Eastern Victoria 374
## 3 Northern Metropolitan 13593
## 4 Northern Victoria 455
## 5 South-Eastern Metropolitan 4008
## 6 Southern Metropolitan 12545
## 7 Western Metropolitan 9680
## 8 Western Victoria 181
ggplot(region_frequencies, aes(x = reorder(Var1, Freq), y = Freq)) +
geom_bar(stat = "identity", fill = "skyblue", color = "white") +
labs(title = "Frequency of Regionname", x = "Region", y = "Frequency") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
method_frequencies <- as.data.frame(table(data_cleaned$Method))
method_frequencies
## Var1 Freq
## 1 PI 5938
## 2 PN 0
## 3 S 30616
## 4 SA 365
## 5 SN 0
## 6 SP 6477
## 7 SS 0
## 8 VB 5023
## 9 W 0
ggplot(method_frequencies, aes(x = reorder(Var1, Freq), y = Freq)) +
geom_bar(stat = "identity", fill = "skyblue", color = "white") +
labs(title = "Frequency of Method", x = "Method", y = "Frequency") +
theme_minimal() +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
type_frequencies <- as.data.frame(table(data_cleaned$Type))
type_frequencies
## Var1 Freq
## 1 h 34147
## 2 t 4980
## 3 u 9292
ggplot(type_frequencies, aes(x = reorder(Var1, Freq), y = Freq)) +
geom_bar(stat = "identity", fill = "skyblue", color = "white") +
labs(title = "Frequency of Type", x = "Type", y = "Frequency") +
scale_x_discrete(labels = c("h" = "House", "u" = "Unit", "t" = "Townhouse")) +
theme_minimal()
councilarea_frequencies <- as.data.frame(table(data_cleaned$CouncilArea))
councilarea_frequencies
## Var1 Freq
## 1 Banyule City Council 2900
## 2 Bayside City Council 2495
## 3 Boroondara City Council 3454
## 4 Brimbank City Council 2720
## 5 Cardinia Shire Council 52
## 6 Casey City Council 343
## 7 Darebin City Council 3462
## 8 Frankston City Council 656
## 9 Glen Eira City Council 2351
## 10 Greater Dandenong City Council 596
## 11 Hobsons Bay City Council 1112
## 12 Hume City Council 2373
## 13 Kingston City Council 2024
## 14 Knox City Council 749
## 15 Macedon Ranges Shire Council 114
## 16 Manningham City Council 1730
## 17 Maribyrnong City Council 1734
## 18 Maroondah City Council 1006
## 19 Melbourne City Council 2054
## 20 Melton City Council 551
## 21 Mitchell Shire Council 29
## 22 Monash City Council 2439
## 23 Moonee Valley City Council 2163
## 24 Moorabool Shire Council 11
## 25 Moreland City Council 2517
## 26 Murrindindi Shire Council 1
## 27 Nillumbik Shire Council 238
## 28 Port Phillip City Council 1291
## 29 Stonnington City Council 1227
## 30 Whitehorse City Council 1319
## 31 Whittlesea City Council 2059
## 32 Wyndham City Council 1154
## 33 Yarra City Council 1320
## 34 Yarra Ranges Shire Council 175
ggplot(councilarea_frequencies, aes(x = reorder(Var1, Freq), y = Freq)) +
geom_bar(stat = "identity", fill = "skyblue", color = "white") +
labs(title = "Frequency of Council Area", x = "Type", y = "Frequency") +
theme_minimal() +
theme(axis.text.x = element_text(angle =90, hjust = 1))
Regardim the method of sale, there are 9 levels, those being: -S (Sold): The property was successfully sold at the auction.
-SP (Sold Prior): The property was sold before the auction took place.
-PI (Passed In): The property did not meet the reserve price at auction and was not sold. It may still be on the market or available for negotiation.
-PN (Sold Prior Not Disclosed): The property was sold before the auction, but the sale price was not disclosed.
-SN (Sold Not Disclosed): The property was sold at the auction, but the sale price was not disclosed.
-VB (Vendor Bid): A bid was placed by the vendor (seller) during the auction, often to help stimulate bidding, but this does not constitute a sale.
-W (Withdrawn Prior to Auction): The property was withdrawn from the auction before it took place, possibly because it was sold prior or the seller decided not to proceed.
-SA (Sold After Auction): The property did not sell during the auction but was sold afterward, either through negotiations or another method.
-SS (Sold After Auction Price Not Disclosed): The property was sold after the auction, but the sale price was not disclosed.
Out of these 9 levels, four did not occur in the cleaned dataset. This could be due to the following reasons:
PN (Sold Prior Not Disclosed), SN (Sold Not Disclosed), and SS (Sold After Auction Price Not Disclosed): These methods involve transactions where the sale price was not disclosed. During the pre-processing stage, any entries with missing price data were likely removed, which explains their absence in the cleaned dataset.
W (Withdrawn Prior to Auction): This method indicates that the
property was withdrawn before the auction, likely resulting in no
associated sale price. Since our analysis is focused on transactions
with available price data, these entries would not be relevant and were
consequently excluded.
Regarding the type variable, the
distribution of property types in the dataset reflects a market where
houses are the predominant type of property being sold. This could
influence various aspects of the market analysis, such as price trends
and buyer demographics, and should be taken into account when
interpreting the results of any further analysis.
3.1 Bivariate Analysis
cor_matrix <- cor(data_cleaned %>% select_if(is.numeric))
cor_matrix
## Rooms Price Postcode Propertycount Distance
## Rooms 1.00000000 0.414067323 0.0933474308 -0.0578419190 0.283653854
## Price 0.41406732 1.000000000 0.0031562658 -0.0607194850 -0.253832105
## Postcode 0.09334743 0.003156266 1.0000000000 -0.0009054878 0.504401504
## Propertycount -0.05784192 -0.060719485 -0.0009054878 1.0000000000 0.007647533
## Distance 0.28365385 -0.253832105 0.5044015045 0.0076475335 1.000000000
## Log_Price 0.46680504 0.927242791 -0.0139493522 -0.0851143692 -0.260759479
## Log_Price
## Rooms 0.46680504
## Price 0.92724279
## Postcode -0.01394935
## Propertycount -0.08511437
## Distance -0.26075948
## Log_Price 1.00000000
corrplot(cor_matrix)
Rooms by Price
price_by_room <- ggplot(data_cleaned, aes(x = as.factor(Rooms), y= Log_Price)) +
geom_violin(trim = FALSE, fill = "lightblue", color = "black")+
geom_boxplot(width = 0.1, fill = "white", outlier.color = "red") +
labs (x = "Number of Rooms", title = "Houses Log(Prices) by Rooms")
price_by_room +
theme_minimal()
The analysis of the relationship between the number of rooms and price
(log-transformed) suggests that, as expected, houses with more rooms
tend to have higher prices. However, the variation in prices within each
room category is significant, indicating that the number of rooms is not
the sole determinant of property prices.The presence of outliers and the
wide range of values observed reinforce the need to consider other
variables when modeling price. The correlation between price and
distance from the city center is negative (-0.25). This suggests that
properties located farther from the city center tend to have lower
prices. There is a positive correlation between the number of rooms and
distance from the city center (0.28). This might indicate that larger
properties with more rooms are more commonly found in suburban or rural
areas rather than in the city center, where space is limited and
properties tend to be smaller.
price_by_type <- ggplot(data_cleaned, aes(Type, Log_Price)) +
geom_violin(trim = FALSE, fill = "lightblue", color = "black") +
geom_boxplot(width = 0.1, fill = "white", outlier.color = "red") +
labs(title = "Violin Plot of Price by Type", x = "Type of Property", y = "Price") +
scale_x_discrete(labels = c("h" = "House", "u" = "Unit", "t" = "Townhouse")) +
theme_minimal()
price_by_type
The price distribution among different property types (houses, units,
and townhouses) shows clear differences. Houses tend to have higher
prices compared to units, while townhouses are in an intermediate
position.
#price_by_method
price_by_method <- ggplot(data_cleaned, aes(Method, Log_Price)) +
geom_violin(trim = FALSE, fill = "lightblue", color = "black") +
geom_boxplot(width = 0.1, fill = "white", outlier.color = "red") +
labs(title = "Violin Plot of Price by Method", x = "Sell Method", y = "Price") +
scale_x_discrete(labels = c("PI" = "Passed In", "S" = "Sold", "SA" = "Sold After Auction", "SP" = "Sold Prior", "VB" = "Vendor Bid")) +
theme_minimal()
price_by_method
#price_by_Region
price_by_region <- ggplot(data_cleaned, aes(Regionname, Log_Price)) +
geom_violin(trim = FALSE, fill = "lightblue", color = "black") +
geom_boxplot(width = 0.1, fill = "white", outlier.color = "red") +
labs(title = "Violin Plot of Price by Region", x = "Region Name", y = "Price") +
theme_minimal()+
theme(axis.text.x = element_text(angle =45, hjust = 1))
price_by_region
#Seller- Top sellers
sales_by_seller <- data_cleaned %>%
group_by(SellerG) %>%
summarise(num_sales = n(),
total_sales = sum(Price)
) %>%
arrange(desc(num_sales))
summary(sales_by_seller)
## SellerG num_sales total_sales
## Length:422 Min. : 1.00 Min. :3.250e+05
## Class :character 1st Qu.: 1.00 1st Qu.:1.308e+06
## Mode :character Median : 5.00 Median :4.446e+06
## Mean : 114.74 Mean :1.144e+08
## 3rd Qu.: 30.75 3rd Qu.:2.490e+07
## Max. :4818.00 Max. :5.298e+09
top_sellers <- slice_head(sales_by_seller, n=20)
top_sellers
## # A tibble: 20 × 3
## SellerG num_sales total_sales
## <chr> <int> <dbl>
## 1 Barry 4818 4022899061
## 2 Jellis 4087 5298022007
## 3 Nelson 4007 4086297166
## 4 Ray 3650 2950388659
## 5 hockingstuart 3465 3174822461
## 6 Buxton 2578 3095943274
## 7 Marshall 1720 3331132138
## 8 Fletchers 1160 1384291988
## 9 Biggin 1022 1002291929
## 10 Brad 911 743992700
## 11 Harcourts 911 783065159
## 12 YPA 897 513150700
## 13 Woodards 872 914910059
## 14 McGrath 860 851140627
## 15 Noel 835 961840513
## 16 Hodges 713 820041438
## 17 Stockdale 693 488813242
## 18 Greg 633 717565749
## 19 HAR 578 382470773
## 20 Jas 573 502753790
top_money <- sales_by_seller %>%
arrange(desc(total_sales)) %>%
slice_head(n=20)
top_money
## # A tibble: 20 × 3
## SellerG num_sales total_sales
## <chr> <int> <dbl>
## 1 Jellis 4087 5298022007
## 2 Nelson 4007 4086297166
## 3 Barry 4818 4022899061
## 4 Marshall 1720 3331132138
## 5 hockingstuart 3465 3174822461
## 6 Buxton 2578 3095943274
## 7 Ray 3650 2950388659
## 8 Fletchers 1160 1384291988
## 9 Biggin 1022 1002291929
## 10 Noel 835 961840513
## 11 Woodards 872 914910059
## 12 McGrath 860 851140627
## 13 Hodges 713 820041438
## 14 Harcourts 911 783065159
## 15 Brad 911 743992700
## 16 Greg 633 717565749
## 17 RT 480 717163566
## 18 Kay 284 589685550
## 19 Miles 529 569323888
## 20 Gary 514 530708150
percent_nsalesbyseller <- round(((sum(top_money$num_sales)/nrow(data_cleaned)) * 100), digits = 2)
percent_nsalesbyseller
## [1] 70.32
ggplot(sales_by_seller, aes(num_sales, total_sales))+
geom_point()
The market is dominated by a few key real estate agents, such as
“Barry,” “Jellis,” and “Nelson,” which together handle a large number of
transactions and generate significant sales values. We can also see that
the combined number of sales from the top 20 sellers represents
approximately 70.32% of all sales made by the 422 agents. This further
illustrate the concentration of market power in the hands of a few
sellers
#working with the date variable
data_cleaned$Date <- as.Date(data_cleaned$Date, format = "%d/%m/%Y")
data_cleaned <- data_cleaned %>%
mutate(Year = format(Date, "%Y"),
Month = month(Date, label = TRUE, abbr = TRUE),
day = format(Date, "%d"),
DayOfWeekAbbrev = wday(Date, label = TRUE))
sales <- data_cleaned %>%
group_by(Date) %>%
summarise(sales_n = n(),
avg_price = mean(Price))
n_salesm <- ggplot(sales, aes(x = Date, y = sales_n))+
geom_line()+
theme_minimal() +
scale_x_date(date_breaks = "1 month", date_labels = "%b %Y") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
price_flut <- ggplot(sales, aes(x = Date, y = avg_price))+
geom_line()+
theme_minimal() +
scale_x_date(date_breaks = "1 month", date_labels = "%b %Y") +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
n_salesm
price_flut
s_year <- data_cleaned %>%
group_by(Year) %>%
summarise(sales_n = n(),
avg_price = (mean(Price)/1000000))
s_year
## # A tibble: 3 × 3
## Year sales_n avg_price
## <chr> <int> <dbl>
## 1 2016 13081 0.966
## 2 2017 20270 1.02
## 3 2018 15068 0.996
There was an increase in the number of transactions and average prices from 2016 to 2017, followed by a decline in both metrics in 2018.
s_month <- data_cleaned %>%
group_by(Month) %>%
summarise(sales_n = n(),
avg_price = (mean(Price)/1000000))
by_month <- ggplot(s_month, aes(x = Month, y = sales_n)) +
geom_bar(stat = "identity", fill = "skyblue", color = "white") +
theme_minimal() +
labs(y = "Number of Sales", title = "Number of Sales by Month")
by_month_pflut <- ggplot(s_month, aes(x = Month, y = avg_price)) +
geom_bar(stat = "identity", fill = "skyblue", color = "white") +
theme_minimal() +
scale_y_continuous(labels = label_comma(suffix = "M")) +
labs(y = "Average Price (in millions)", title = "Price by Month")
s_month
## # A tibble: 12 × 3
## Month sales_n avg_price
## <ord> <int> <dbl>
## 1 Jan 1281 0.947
## 2 Feb 2552 0.977
## 3 Mar 3835 1.06
## 4 Apr 4415 0.942
## 5 May 5716 1.02
## 6 Jun 4679 1.00
## 7 Jul 3771 0.873
## 8 Aug 4152 1.02
## 9 Sep 5515 1.03
## 10 Oct 4243 1.01
## 11 Nov 4351 1.02
## 12 Dec 3909 0.990
by_month
by_month_pflut
s_week <-data_cleaned %>%
group_by(DayOfWeekAbbrev) %>%
summarise(sales_n = n(),
avg_price = (mean(Price)/1000000))
by_day <- ggplot(s_week, aes(x = DayOfWeekAbbrev, y = sales_n)) +
geom_bar(stat = "identity", fill = "skyblue", color = "white") +
theme_minimal() +
labs(y = "Number of Sales", title = "Number of Sales by Month")
by_day_pflut <- ggplot(s_week, aes(x = DayOfWeekAbbrev, y = avg_price), color = Year) +
geom_bar(stat = "identity", fill = "skyblue", color = "white") +
theme_minimal() +
scale_y_continuous(labels = label_comma(suffix = "M")) +
labs(y = "Average Price (in millions) ", title = "Price by Week Day")
s_week
## # A tibble: 5 × 3
## DayOfWeekAbbrev sales_n avg_price
## <ord> <int> <dbl>
## 1 Sun 2201 1.01
## 2 Mon 1894 0.929
## 3 Tue 254 0.896
## 4 Thu 99 0.747
## 5 Sat 43971 1.00
by_day
by_day_pflut
Most property transactions occur on weekends, particularly on Saturdays. Prices on Sundays appear to be slightly higher on average, possibly due to the finalization of deals or higher-quality properties being transacted.
4 Regression Analysis
data_cleaned$Rooms <- as.factor(data_cleaned$Rooms)
model1 <- lm(Log_Price ~ Rooms + Distance + Regionname + Type + Regionname + Propertycount + Date + CouncilArea, data = data_cleaned)
summary(model1)
##
## Call:
## lm(formula = Log_Price ~ Rooms + Distance + Regionname + Type +
## Regionname + Propertycount + Date + CouncilArea, data = data_cleaned)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.5024 -0.1668 -0.0126 0.1488 2.2551
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 1.037e+01 8.101e-02 128.014
## Rooms2 5.179e-01 7.027e-03 73.691
## Rooms3 7.678e-01 7.490e-03 102.518
## Rooms4 9.570e-01 7.883e-03 121.405
## Rooms5 1.113e+00 9.283e-03 119.954
## Rooms6 1.165e+00 1.718e-02 67.836
## Rooms7 1.136e+00 4.388e-02 25.881
## Rooms8 1.169e+00 5.996e-02 19.491
## Distance -2.768e-02 4.457e-04 -62.092
## RegionnameEastern Victoria -8.048e-02 2.432e-02 -3.309
## RegionnameNorthern Metropolitan -2.161e-01 8.204e-03 -26.336
## RegionnameNorthern Victoria -1.168e-01 2.764e-02 -4.227
## RegionnameSouth-Eastern Metropolitan -1.253e-01 1.098e-02 -11.415
## RegionnameSouthern Metropolitan -1.063e-01 8.675e-03 -12.250
## RegionnameWestern Metropolitan -1.279e-01 1.476e-02 -8.666
## RegionnameWestern Victoria -1.448e-01 2.907e-02 -4.981
## Typet -2.078e-01 4.073e-03 -51.008
## Typeu -4.422e-01 4.001e-03 -110.519
## Propertycount -1.086e-06 3.271e-07 -3.320
## Date 1.728e-04 4.633e-06 37.298
## CouncilAreaBayside City Council 5.613e-01 1.069e-02 52.491
## CouncilAreaBoroondara City Council 4.870e-01 1.022e-02 47.642
## CouncilAreaBrimbank City Council -2.621e-01 1.491e-02 -17.575
## CouncilAreaCardinia Shire Council 2.946e-01 4.490e-02 6.560
## CouncilAreaCasey City Council 1.383e-01 2.031e-02 6.808
## CouncilAreaDarebin City Council 1.359e-01 8.905e-03 15.261
## CouncilAreaFrankston City Council 3.922e-01 1.810e-02 21.674
## CouncilAreaGlen Eira City Council 3.322e-01 1.068e-02 31.107
## CouncilAreaGreater Dandenong City Council 9.393e-02 1.624e-02 5.785
## CouncilAreaHobsons Bay City Council 4.077e-02 1.650e-02 2.472
## CouncilAreaHume City Council -2.082e-01 9.217e-03 -22.589
## CouncilAreaKingston City Council 3.893e-01 1.290e-02 30.175
## CouncilAreaKnox City Council 3.211e-02 1.171e-02 2.743
## CouncilAreaMacedon Ranges Shire Council 6.272e-01 3.697e-02 16.964
## CouncilAreaManningham City Council 1.740e-01 8.646e-03 20.120
## CouncilAreaMaribyrnong City Council -9.622e-02 1.616e-02 -5.954
## CouncilAreaMaroondah City Council 9.084e-02 1.081e-02 8.406
## CouncilAreaMelbourne City Council 2.983e-01 9.912e-03 30.097
## CouncilAreaMelton City Council -2.991e-01 1.927e-02 -15.523
## CouncilAreaMitchell Shire Council 1.489e-01 5.549e-02 2.683
## CouncilAreaMonash City Council 2.709e-01 8.649e-03 31.318
## CouncilAreaMoonee Valley City Council 1.119e-01 1.568e-02 7.140
## CouncilAreaMoorabool Shire Council 9.516e-02 8.300e-02 1.147
## CouncilAreaMoreland City Council 8.388e-02 8.947e-03 9.375
## CouncilAreaMurrindindi Shire Council 7.262e-01 2.607e-01 2.786
## CouncilAreaNillumbik Shire Council -3.345e-02 2.893e-02 -1.156
## CouncilAreaPort Phillip City Council 3.296e-01 1.215e-02 27.130
## CouncilAreaStonnington City Council 4.831e-01 1.206e-02 40.048
## CouncilAreaWhitehorse City Council 1.613e-01 9.315e-03 17.320
## CouncilAreaWhittlesea City Council -1.184e-01 9.396e-03 -12.598
## CouncilAreaWyndham City Council -3.759e-01 1.609e-02 -23.363
## CouncilAreaYarra City Council 3.341e-01 1.103e-02 30.274
## CouncilAreaYarra Ranges Shire Council 1.362e-01 3.062e-02 4.449
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## Rooms2 < 2e-16 ***
## Rooms3 < 2e-16 ***
## Rooms4 < 2e-16 ***
## Rooms5 < 2e-16 ***
## Rooms6 < 2e-16 ***
## Rooms7 < 2e-16 ***
## Rooms8 < 2e-16 ***
## Distance < 2e-16 ***
## RegionnameEastern Victoria 0.000938 ***
## RegionnameNorthern Metropolitan < 2e-16 ***
## RegionnameNorthern Victoria 2.37e-05 ***
## RegionnameSouth-Eastern Metropolitan < 2e-16 ***
## RegionnameSouthern Metropolitan < 2e-16 ***
## RegionnameWestern Metropolitan < 2e-16 ***
## RegionnameWestern Victoria 6.34e-07 ***
## Typet < 2e-16 ***
## Typeu < 2e-16 ***
## Propertycount 0.000901 ***
## Date < 2e-16 ***
## CouncilAreaBayside City Council < 2e-16 ***
## CouncilAreaBoroondara City Council < 2e-16 ***
## CouncilAreaBrimbank City Council < 2e-16 ***
## CouncilAreaCardinia Shire Council 5.43e-11 ***
## CouncilAreaCasey City Council 9.98e-12 ***
## CouncilAreaDarebin City Council < 2e-16 ***
## CouncilAreaFrankston City Council < 2e-16 ***
## CouncilAreaGlen Eira City Council < 2e-16 ***
## CouncilAreaGreater Dandenong City Council 7.28e-09 ***
## CouncilAreaHobsons Bay City Council 0.013455 *
## CouncilAreaHume City Council < 2e-16 ***
## CouncilAreaKingston City Council < 2e-16 ***
## CouncilAreaKnox City Council 0.006097 **
## CouncilAreaMacedon Ranges Shire Council < 2e-16 ***
## CouncilAreaManningham City Council < 2e-16 ***
## CouncilAreaMaribyrnong City Council 2.63e-09 ***
## CouncilAreaMaroondah City Council < 2e-16 ***
## CouncilAreaMelbourne City Council < 2e-16 ***
## CouncilAreaMelton City Council < 2e-16 ***
## CouncilAreaMitchell Shire Council 0.007295 **
## CouncilAreaMonash City Council < 2e-16 ***
## CouncilAreaMoonee Valley City Council 9.47e-13 ***
## CouncilAreaMoorabool Shire Council 0.251576
## CouncilAreaMoreland City Council < 2e-16 ***
## CouncilAreaMurrindindi Shire Council 0.005343 **
## CouncilAreaNillumbik Shire Council 0.247577
## CouncilAreaPort Phillip City Council < 2e-16 ***
## CouncilAreaStonnington City Council < 2e-16 ***
## CouncilAreaWhitehorse City Council < 2e-16 ***
## CouncilAreaWhittlesea City Council < 2e-16 ***
## CouncilAreaWyndham City Council < 2e-16 ***
## CouncilAreaYarra City Council < 2e-16 ***
## CouncilAreaYarra Ranges Shire Council 8.64e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2592 on 48366 degrees of freedom
## Multiple R-squared: 0.7281, Adjusted R-squared: 0.7279
## F-statistic: 2491 on 52 and 48366 DF, p-value: < 2.2e-16
library(modelr)
##
## Attaching package: 'modelr'
## The following object is masked from 'package:broom':
##
## bootstrap
data.frame(R2 = rsquare(model1, data = data_cleaned),
RMSE = rmse(model1, data = data_cleaned),
MAE =mae(model1, data = data_cleaned))
## R2 RMSE MAE
## 1 0.7281469 0.2590646 0.1975971
glance(model1)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.728 0.728 0.259 2491. 0 52 -3305. 6718. 7193.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
data_cleaned$Rooms <- as.factor(data_cleaned$Rooms)
model2 <- lm(Log_Price ~ Rooms + Distance + Regionname + Type + Regionname + Propertycount + CouncilArea, data = data_cleaned)
summary(model2)
##
## Call:
## lm(formula = Log_Price ~ Rooms + Distance + Regionname + Type +
## Regionname + Propertycount + CouncilArea, data = data_cleaned)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.53795 -0.16902 -0.01025 0.15262 2.26022
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 1.336e+01 1.100e-02 1215.118
## Rooms2 5.201e-01 7.127e-03 72.971
## Rooms3 7.701e-01 7.596e-03 101.374
## Rooms4 9.600e-01 7.995e-03 120.076
## Rooms5 1.115e+00 9.415e-03 118.475
## Rooms6 1.169e+00 1.742e-02 67.090
## Rooms7 1.132e+00 4.451e-02 25.443
## Rooms8 1.164e+00 6.081e-02 19.142
## Distance -2.740e-02 4.520e-04 -60.619
## RegionnameEastern Victoria -7.238e-02 2.467e-02 -2.934
## RegionnameNorthern Metropolitan -2.143e-01 8.321e-03 -25.758
## RegionnameNorthern Victoria -1.056e-01 2.804e-02 -3.768
## RegionnameSouth-Eastern Metropolitan -1.263e-01 1.114e-02 -11.340
## RegionnameSouthern Metropolitan -1.059e-01 8.799e-03 -12.034
## RegionnameWestern Metropolitan -1.225e-01 1.497e-02 -8.186
## RegionnameWestern Victoria -1.340e-01 2.949e-02 -4.543
## Typet -2.039e-01 4.130e-03 -49.377
## Typeu -4.400e-01 4.058e-03 -108.427
## Propertycount -1.136e-06 3.317e-07 -3.425
## CouncilAreaBayside City Council 5.581e-01 1.085e-02 51.455
## CouncilAreaBoroondara City Council 4.888e-01 1.037e-02 47.155
## CouncilAreaBrimbank City Council -2.657e-01 1.512e-02 -17.569
## CouncilAreaCardinia Shire Council 2.830e-01 4.554e-02 6.214
## CouncilAreaCasey City Council 1.354e-01 2.060e-02 6.573
## CouncilAreaDarebin City Council 1.411e-01 9.031e-03 15.628
## CouncilAreaFrankston City Council 3.855e-01 1.835e-02 21.006
## CouncilAreaGlen Eira City Council 3.324e-01 1.083e-02 30.683
## CouncilAreaGreater Dandenong City Council 9.235e-02 1.647e-02 5.608
## CouncilAreaHobsons Bay City Council 3.640e-02 1.673e-02 2.176
## CouncilAreaHume City Council -1.999e-01 9.346e-03 -21.391
## CouncilAreaKingston City Council 3.880e-01 1.309e-02 29.651
## CouncilAreaKnox City Council 3.748e-02 1.187e-02 3.157
## CouncilAreaMacedon Ranges Shire Council 6.194e-01 3.750e-02 16.516
## CouncilAreaManningham City Council 1.703e-01 8.769e-03 19.425
## CouncilAreaMaribyrnong City Council -9.698e-02 1.639e-02 -5.917
## CouncilAreaMaroondah City Council 8.949e-02 1.096e-02 8.165
## CouncilAreaMelbourne City Council 2.991e-01 1.005e-02 29.745
## CouncilAreaMelton City Council -3.000e-01 1.954e-02 -15.352
## CouncilAreaMitchell Shire Council 1.549e-01 5.628e-02 2.752
## CouncilAreaMonash City Council 2.680e-01 8.772e-03 30.549
## CouncilAreaMoonee Valley City Council 1.095e-01 1.590e-02 6.885
## CouncilAreaMoorabool Shire Council 1.133e-01 8.418e-02 1.346
## CouncilAreaMoreland City Council 8.566e-02 9.075e-03 9.439
## CouncilAreaMurrindindi Shire Council 7.644e-01 2.644e-01 2.891
## CouncilAreaNillumbik Shire Council -2.884e-02 2.935e-02 -0.983
## CouncilAreaPort Phillip City Council 3.318e-01 1.232e-02 26.930
## CouncilAreaStonnington City Council 4.854e-01 1.223e-02 39.673
## CouncilAreaWhitehorse City Council 1.612e-01 9.448e-03 17.057
## CouncilAreaWhittlesea City Council -1.132e-01 9.529e-03 -11.879
## CouncilAreaWyndham City Council -3.644e-01 1.632e-02 -22.333
## CouncilAreaYarra City Council 3.387e-01 1.119e-02 30.261
## CouncilAreaYarra Ranges Shire Council 1.333e-01 3.106e-02 4.292
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## Rooms2 < 2e-16 ***
## Rooms3 < 2e-16 ***
## Rooms4 < 2e-16 ***
## Rooms5 < 2e-16 ***
## Rooms6 < 2e-16 ***
## Rooms7 < 2e-16 ***
## Rooms8 < 2e-16 ***
## Distance < 2e-16 ***
## RegionnameEastern Victoria 0.003345 **
## RegionnameNorthern Metropolitan < 2e-16 ***
## RegionnameNorthern Victoria 0.000165 ***
## RegionnameSouth-Eastern Metropolitan < 2e-16 ***
## RegionnameSouthern Metropolitan < 2e-16 ***
## RegionnameWestern Metropolitan 2.76e-16 ***
## RegionnameWestern Victoria 5.55e-06 ***
## Typet < 2e-16 ***
## Typeu < 2e-16 ***
## Propertycount 0.000616 ***
## CouncilAreaBayside City Council < 2e-16 ***
## CouncilAreaBoroondara City Council < 2e-16 ***
## CouncilAreaBrimbank City Council < 2e-16 ***
## CouncilAreaCardinia Shire Council 5.21e-10 ***
## CouncilAreaCasey City Council 4.97e-11 ***
## CouncilAreaDarebin City Council < 2e-16 ***
## CouncilAreaFrankston City Council < 2e-16 ***
## CouncilAreaGlen Eira City Council < 2e-16 ***
## CouncilAreaGreater Dandenong City Council 2.06e-08 ***
## CouncilAreaHobsons Bay City Council 0.029586 *
## CouncilAreaHume City Council < 2e-16 ***
## CouncilAreaKingston City Council < 2e-16 ***
## CouncilAreaKnox City Council 0.001597 **
## CouncilAreaMacedon Ranges Shire Council < 2e-16 ***
## CouncilAreaManningham City Council < 2e-16 ***
## CouncilAreaMaribyrnong City Council 3.30e-09 ***
## CouncilAreaMaroondah City Council 3.28e-16 ***
## CouncilAreaMelbourne City Council < 2e-16 ***
## CouncilAreaMelton City Council < 2e-16 ***
## CouncilAreaMitchell Shire Council 0.005934 **
## CouncilAreaMonash City Council < 2e-16 ***
## CouncilAreaMoonee Valley City Council 5.86e-12 ***
## CouncilAreaMoorabool Shire Council 0.178208
## CouncilAreaMoreland City Council < 2e-16 ***
## CouncilAreaMurrindindi Shire Council 0.003844 **
## CouncilAreaNillumbik Shire Council 0.325664
## CouncilAreaPort Phillip City Council < 2e-16 ***
## CouncilAreaStonnington City Council < 2e-16 ***
## CouncilAreaWhitehorse City Council < 2e-16 ***
## CouncilAreaWhittlesea City Council < 2e-16 ***
## CouncilAreaWyndham City Council < 2e-16 ***
## CouncilAreaYarra City Council < 2e-16 ***
## CouncilAreaYarra Ranges Shire Council 1.77e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2629 on 48367 degrees of freedom
## Multiple R-squared: 0.7203, Adjusted R-squared: 0.72
## F-statistic: 2443 on 51 and 48367 DF, p-value: < 2.2e-16
data.frame(R2 = rsquare(model2, data = data_cleaned),
RMSE = rmse(model2, data = data_cleaned),
MAE =mae(model2, data = data_cleaned))
## R2 RMSE MAE
## 1 0.7203278 0.2627639 0.2006794
glance(model2)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.720 0.720 0.263 2443. 0 51 -3992. 8089. 8555.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
model3 <- lm(Log_Price ~ Rooms + Distance + Type + Regionname + Date + CouncilArea, data = data_cleaned)
summary(model3)
##
## Call:
## lm(formula = Log_Price ~ Rooms + Distance + Type + Regionname +
## Date + CouncilArea, data = data_cleaned)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.49827 -0.16693 -0.01301 0.14887 2.25624
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 1.037e+01 8.100e-02 127.964
## Rooms2 5.185e-01 7.025e-03 73.812
## Rooms3 7.683e-01 7.489e-03 102.596
## Rooms4 9.577e-01 7.881e-03 121.509
## Rooms5 1.114e+00 9.283e-03 119.977
## Rooms6 1.165e+00 1.718e-02 67.837
## Rooms7 1.136e+00 4.388e-02 25.887
## Rooms8 1.170e+00 5.996e-02 19.507
## Distance -2.792e-02 4.398e-04 -63.476
## Typet -2.079e-01 4.074e-03 -51.043
## Typeu -4.430e-01 3.995e-03 -110.895
## RegionnameEastern Victoria -7.959e-02 2.432e-02 -3.272
## RegionnameNorthern Metropolitan -2.154e-01 8.202e-03 -26.256
## RegionnameNorthern Victoria -1.135e-01 2.763e-02 -4.107
## RegionnameSouth-Eastern Metropolitan -1.222e-01 1.094e-02 -11.173
## RegionnameSouthern Metropolitan -1.045e-01 8.660e-03 -12.068
## RegionnameWestern Metropolitan -1.263e-01 1.475e-02 -8.562
## RegionnameWestern Victoria -1.397e-01 2.903e-02 -4.810
## Date 1.729e-04 4.633e-06 37.308
## CouncilAreaBayside City Council 5.583e-01 1.066e-02 52.392
## CouncilAreaBoroondara City Council 4.816e-01 1.009e-02 47.710
## CouncilAreaBrimbank City Council -2.633e-01 1.491e-02 -17.662
## CouncilAreaCardinia Shire Council 2.951e-01 4.491e-02 6.571
## CouncilAreaCasey City Council 1.363e-01 2.031e-02 6.714
## CouncilAreaDarebin City Council 1.252e-01 8.304e-03 15.079
## CouncilAreaFrankston City Council 3.884e-01 1.806e-02 21.505
## CouncilAreaGlen Eira City Council 3.284e-01 1.062e-02 30.926
## CouncilAreaGreater Dandenong City Council 8.932e-02 1.618e-02 5.521
## CouncilAreaHobsons Bay City Council 3.853e-02 1.648e-02 2.338
## CouncilAreaHume City Council -2.104e-01 9.194e-03 -22.886
## CouncilAreaKingston City Council 3.901e-01 1.290e-02 30.235
## CouncilAreaKnox City Council 3.145e-02 1.171e-02 2.687
## CouncilAreaMacedon Ranges Shire Council 6.349e-01 3.690e-02 17.204
## CouncilAreaManningham City Council 1.725e-01 8.637e-03 19.978
## CouncilAreaMaribyrnong City Council -9.914e-02 1.614e-02 -6.143
## CouncilAreaMaroondah City Council 9.162e-02 1.080e-02 8.480
## CouncilAreaMelbourne City Council 2.912e-01 9.674e-03 30.095
## CouncilAreaMelton City Council -2.987e-01 1.927e-02 -15.502
## CouncilAreaMitchell Shire Council 1.546e-01 5.547e-02 2.787
## CouncilAreaMonash City Council 2.661e-01 8.531e-03 31.195
## CouncilAreaMoonee Valley City Council 1.092e-01 1.566e-02 6.976
## CouncilAreaMoorabool Shire Council 9.905e-02 8.300e-02 1.193
## CouncilAreaMoreland City Council 7.947e-02 8.849e-03 8.981
## CouncilAreaMurrindindi Shire Council 7.364e-01 2.607e-01 2.824
## CouncilAreaNillumbik Shire Council -3.329e-02 2.894e-02 -1.151
## CouncilAreaPort Phillip City Council 3.234e-01 1.200e-02 26.938
## CouncilAreaStonnington City Council 4.782e-01 1.197e-02 39.937
## CouncilAreaWhitehorse City Council 1.622e-01 9.312e-03 17.423
## CouncilAreaWhittlesea City Council -1.210e-01 9.363e-03 -12.927
## CouncilAreaWyndham City Council -3.849e-01 1.586e-02 -24.266
## CouncilAreaYarra City Council 3.269e-01 1.082e-02 30.206
## CouncilAreaYarra Ranges Shire Council 1.386e-01 3.061e-02 4.528
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## Rooms2 < 2e-16 ***
## Rooms3 < 2e-16 ***
## Rooms4 < 2e-16 ***
## Rooms5 < 2e-16 ***
## Rooms6 < 2e-16 ***
## Rooms7 < 2e-16 ***
## Rooms8 < 2e-16 ***
## Distance < 2e-16 ***
## Typet < 2e-16 ***
## Typeu < 2e-16 ***
## RegionnameEastern Victoria 0.00107 **
## RegionnameNorthern Metropolitan < 2e-16 ***
## RegionnameNorthern Victoria 4.01e-05 ***
## RegionnameSouth-Eastern Metropolitan < 2e-16 ***
## RegionnameSouthern Metropolitan < 2e-16 ***
## RegionnameWestern Metropolitan < 2e-16 ***
## RegionnameWestern Victoria 1.51e-06 ***
## Date < 2e-16 ***
## CouncilAreaBayside City Council < 2e-16 ***
## CouncilAreaBoroondara City Council < 2e-16 ***
## CouncilAreaBrimbank City Council < 2e-16 ***
## CouncilAreaCardinia Shire Council 5.04e-11 ***
## CouncilAreaCasey City Council 1.91e-11 ***
## CouncilAreaDarebin City Council < 2e-16 ***
## CouncilAreaFrankston City Council < 2e-16 ***
## CouncilAreaGlen Eira City Council < 2e-16 ***
## CouncilAreaGreater Dandenong City Council 3.39e-08 ***
## CouncilAreaHobsons Bay City Council 0.01941 *
## CouncilAreaHume City Council < 2e-16 ***
## CouncilAreaKingston City Council < 2e-16 ***
## CouncilAreaKnox City Council 0.00722 **
## CouncilAreaMacedon Ranges Shire Council < 2e-16 ***
## CouncilAreaManningham City Council < 2e-16 ***
## CouncilAreaMaribyrnong City Council 8.15e-10 ***
## CouncilAreaMaroondah City Council < 2e-16 ***
## CouncilAreaMelbourne City Council < 2e-16 ***
## CouncilAreaMelton City Council < 2e-16 ***
## CouncilAreaMitchell Shire Council 0.00532 **
## CouncilAreaMonash City Council < 2e-16 ***
## CouncilAreaMoonee Valley City Council 3.07e-12 ***
## CouncilAreaMoorabool Shire Council 0.23271
## CouncilAreaMoreland City Council < 2e-16 ***
## CouncilAreaMurrindindi Shire Council 0.00474 **
## CouncilAreaNillumbik Shire Council 0.24990
## CouncilAreaPort Phillip City Council < 2e-16 ***
## CouncilAreaStonnington City Council < 2e-16 ***
## CouncilAreaWhitehorse City Council < 2e-16 ***
## CouncilAreaWhittlesea City Council < 2e-16 ***
## CouncilAreaWyndham City Council < 2e-16 ***
## CouncilAreaYarra City Council < 2e-16 ***
## CouncilAreaYarra Ranges Shire Council 5.98e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2592 on 48367 degrees of freedom
## Multiple R-squared: 0.7281, Adjusted R-squared: 0.7278
## F-statistic: 2539 on 51 and 48367 DF, p-value: < 2.2e-16
data.frame(R2 = rsquare(model3, data = data_cleaned),
RMSE = rmse(model3, data = data_cleaned),
MAE =mae(model3, data = data_cleaned))
## R2 RMSE MAE
## 1 0.7280849 0.2590942 0.1976344
glance(model3)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.728 0.728 0.259 2539. 0 51 -3311. 6727. 7193.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
To analyze the factors influencing property prices in the Melbourne housing market, three different regression models were compared. These models varied in the selection and combination of independent variables, including Rooms, Distance, Regionname, Type, Propertycount, Date, and CouncilArea.
After a detailed comparison based on performance metrics such as R-squared, RMSE (Root Mean Squared Error), and MAE (Mean Absolute Error), Model 3 was selected as the most suitable model. This model exhibited the best balance of predictive accuracy and parsimony, with an R-squared value of 0.728, indicating that approximately 72.8% of the variance in the log-transformed property prices (Log_Price) is explained by the model.
Key Findings from the Regression Model
-Number of Rooms (Rooms): The coefficient for Rooms was
positive and significant, indicating that as the number of rooms
increases, the log-transformed property price also increases.
Specifically, houses with more rooms generally have higher prices, which
aligns with market expectations. For instance, properties with 4 rooms
had a coefficient of 0.957, meaning that all else being equal, these
properties tend to have approximately 95.7% higher log-transformed
prices compared to properties with only one room.
-Distance from City Center (Distance): There was a negative
and significant relationship between Distance and Log_Price. The
coefficient of -0.0277 suggests that properties located farther from the
city center tend to have lower prices. Specifically, for every
one-kilometer increase in distance from the city center, the
log-transformed price decreases by about 2.77%, holding other factors
constant.
-Region and Council Area: The model included several
categorical variables for Regionname and CouncilArea, with significant
differences observed across these regions. For example, properties in
the Bayside City Council area had a coefficient of 0.561, indicating
significantly higher prices compared to properties in other council
areas. Conversely, regions like Western Metropolitan and Western
Victoria showed negative coefficients, reflecting lower property prices
relative to the reference region.
-Property Type (Type): Property type was also a significant
predictor, with Type categories indicating that houses (h) generally
have higher prices compared to units (u) and townhouses (t). For
example, the coefficient for Typeu (units) was -0.442, suggesting that
units are associated with lower prices than houses.
-Property Count (Propertycount): Although the effect of
Propertycount was statistically significant, its impact on price was
relatively small, as indicated by the low coefficient. This suggests
that while the number of properties in a suburb may have some influence
on individual property prices, it is not a major determinant.
plot(model3)
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
vif(model3)
## GVIF Df GVIF^(1/(2*Df))
## Rooms 1.968611e+00 7 1.049570
## Distance 7.944009e+00 1 2.818512
## Type 1.817605e+00 2 1.161114
## Regionname 6.345246e+04 7 2.203089
## Date 1.012013e+00 1 1.005989
## CouncilArea 2.615808e+05 33 1.208050
ks_test <- ks.test(model3$residuals, "pnorm", mean = mean(model3$residuals), sd = sd(model3$residuals))
## Warning in ks.test.default(model3$residuals, "pnorm", mean =
## mean(model3$residuals), : ties should not be present for the Kolmogorov-Smirnov
## test
ks_test
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: model3$residuals
## D = 0.036032, p-value < 2.2e-16
## alternative hypothesis: two-sided
ks_test <- ks.test(model3$residuals, "pnorm", mean = mean(model3$residuals), sd = sd(model3$residuals))
## Warning in ks.test.default(model3$residuals, "pnorm", mean =
## mean(model3$residuals), : ties should not be present for the Kolmogorov-Smirnov
## test
ks_test
##
## Asymptotic one-sample Kolmogorov-Smirnov test
##
## data: model3$residuals
## D = 0.036032, p-value < 2.2e-16
## alternative hypothesis: two-sided
plot(model3, 4)
plot(model3, 5)
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
## Warning in sqrt(crit * p * (1 - hh)/hh): NaNs produced
Assumption Checking
-Linearity: The relationships between the predictors and the
response variable were examined and found to be sufficiently linear for
the purposes of this analysis.
-Normality of Residuals: The residuals exhibited some deviation from normality, as indicated by the Kolmogorov-Smirnov test (p-value < 2.2e-16). However, given the large sample size, the Central Limit Theorem provides some justification for the normality assumption, and the impact on the overall model interpretation was considered minimal.
-Homoscedasticity: The plot of residuals versus fitted
values showed no clear pattern, suggesting that the assumption of
homoscedasticity (constant variance of residuals) was reasonably met.
Multicollinearity: Variance Inflation Factors (VIFs) were calculated
to assess multicollinearity among the predictors. Although some
multicollinearity was present, particularly between Regionname and
CouncilArea, it was within acceptable limits and did not significantly
undermine the model’s stability.
#Test for the outlier remotion
a <- augment(model3) %>%
arrange(desc(.cooksd)) %>%
head()
influential_points <- a %>%
filter(.cooksd > 4 / nrow(a))
data_cleaned_no_outliers <- data_cleaned %>%
filter(!(rownames(data_cleaned) %in% rownames(influential_points)))
model3_no_outliers <- lm(Log_Price ~ Rooms + Distance + Type + Regionname + Date + CouncilArea, data = data_cleaned_no_outliers)
summary(model3_no_outliers)
##
## Call:
## lm(formula = Log_Price ~ Rooms + Distance + Type + Regionname +
## Date + CouncilArea, data = data_cleaned_no_outliers)
##
## Residuals:
## Min 1Q Median 3Q Max
## -2.49827 -0.16693 -0.01301 0.14887 2.25624
##
## Coefficients:
## Estimate Std. Error t value
## (Intercept) 1.037e+01 8.100e-02 127.964
## Rooms2 5.185e-01 7.025e-03 73.812
## Rooms3 7.683e-01 7.489e-03 102.596
## Rooms4 9.577e-01 7.881e-03 121.509
## Rooms5 1.114e+00 9.283e-03 119.977
## Rooms6 1.165e+00 1.718e-02 67.837
## Rooms7 1.136e+00 4.388e-02 25.887
## Rooms8 1.170e+00 5.996e-02 19.507
## Distance -2.792e-02 4.398e-04 -63.476
## Typet -2.079e-01 4.074e-03 -51.043
## Typeu -4.430e-01 3.995e-03 -110.895
## RegionnameEastern Victoria -7.959e-02 2.432e-02 -3.272
## RegionnameNorthern Metropolitan -2.154e-01 8.202e-03 -26.256
## RegionnameNorthern Victoria -1.135e-01 2.763e-02 -4.107
## RegionnameSouth-Eastern Metropolitan -1.222e-01 1.094e-02 -11.173
## RegionnameSouthern Metropolitan -1.045e-01 8.660e-03 -12.068
## RegionnameWestern Metropolitan -1.263e-01 1.475e-02 -8.562
## RegionnameWestern Victoria -1.397e-01 2.903e-02 -4.810
## Date 1.729e-04 4.633e-06 37.308
## CouncilAreaBayside City Council 5.583e-01 1.066e-02 52.392
## CouncilAreaBoroondara City Council 4.816e-01 1.009e-02 47.710
## CouncilAreaBrimbank City Council -2.633e-01 1.491e-02 -17.662
## CouncilAreaCardinia Shire Council 2.951e-01 4.491e-02 6.571
## CouncilAreaCasey City Council 1.363e-01 2.031e-02 6.714
## CouncilAreaDarebin City Council 1.252e-01 8.304e-03 15.079
## CouncilAreaFrankston City Council 3.884e-01 1.806e-02 21.505
## CouncilAreaGlen Eira City Council 3.284e-01 1.062e-02 30.926
## CouncilAreaGreater Dandenong City Council 8.932e-02 1.618e-02 5.521
## CouncilAreaHobsons Bay City Council 3.853e-02 1.648e-02 2.338
## CouncilAreaHume City Council -2.104e-01 9.194e-03 -22.886
## CouncilAreaKingston City Council 3.901e-01 1.290e-02 30.235
## CouncilAreaKnox City Council 3.145e-02 1.171e-02 2.687
## CouncilAreaMacedon Ranges Shire Council 6.349e-01 3.690e-02 17.204
## CouncilAreaManningham City Council 1.725e-01 8.637e-03 19.978
## CouncilAreaMaribyrnong City Council -9.914e-02 1.614e-02 -6.143
## CouncilAreaMaroondah City Council 9.162e-02 1.080e-02 8.480
## CouncilAreaMelbourne City Council 2.912e-01 9.674e-03 30.095
## CouncilAreaMelton City Council -2.987e-01 1.927e-02 -15.502
## CouncilAreaMitchell Shire Council 1.546e-01 5.547e-02 2.787
## CouncilAreaMonash City Council 2.661e-01 8.531e-03 31.195
## CouncilAreaMoonee Valley City Council 1.092e-01 1.566e-02 6.976
## CouncilAreaMoorabool Shire Council 9.905e-02 8.300e-02 1.193
## CouncilAreaMoreland City Council 7.947e-02 8.849e-03 8.981
## CouncilAreaMurrindindi Shire Council 7.364e-01 2.607e-01 2.824
## CouncilAreaNillumbik Shire Council -3.329e-02 2.894e-02 -1.151
## CouncilAreaPort Phillip City Council 3.234e-01 1.200e-02 26.938
## CouncilAreaStonnington City Council 4.782e-01 1.197e-02 39.937
## CouncilAreaWhitehorse City Council 1.622e-01 9.312e-03 17.423
## CouncilAreaWhittlesea City Council -1.210e-01 9.363e-03 -12.927
## CouncilAreaWyndham City Council -3.849e-01 1.586e-02 -24.266
## CouncilAreaYarra City Council 3.269e-01 1.082e-02 30.206
## CouncilAreaYarra Ranges Shire Council 1.386e-01 3.061e-02 4.528
## Pr(>|t|)
## (Intercept) < 2e-16 ***
## Rooms2 < 2e-16 ***
## Rooms3 < 2e-16 ***
## Rooms4 < 2e-16 ***
## Rooms5 < 2e-16 ***
## Rooms6 < 2e-16 ***
## Rooms7 < 2e-16 ***
## Rooms8 < 2e-16 ***
## Distance < 2e-16 ***
## Typet < 2e-16 ***
## Typeu < 2e-16 ***
## RegionnameEastern Victoria 0.00107 **
## RegionnameNorthern Metropolitan < 2e-16 ***
## RegionnameNorthern Victoria 4.01e-05 ***
## RegionnameSouth-Eastern Metropolitan < 2e-16 ***
## RegionnameSouthern Metropolitan < 2e-16 ***
## RegionnameWestern Metropolitan < 2e-16 ***
## RegionnameWestern Victoria 1.51e-06 ***
## Date < 2e-16 ***
## CouncilAreaBayside City Council < 2e-16 ***
## CouncilAreaBoroondara City Council < 2e-16 ***
## CouncilAreaBrimbank City Council < 2e-16 ***
## CouncilAreaCardinia Shire Council 5.04e-11 ***
## CouncilAreaCasey City Council 1.91e-11 ***
## CouncilAreaDarebin City Council < 2e-16 ***
## CouncilAreaFrankston City Council < 2e-16 ***
## CouncilAreaGlen Eira City Council < 2e-16 ***
## CouncilAreaGreater Dandenong City Council 3.39e-08 ***
## CouncilAreaHobsons Bay City Council 0.01941 *
## CouncilAreaHume City Council < 2e-16 ***
## CouncilAreaKingston City Council < 2e-16 ***
## CouncilAreaKnox City Council 0.00722 **
## CouncilAreaMacedon Ranges Shire Council < 2e-16 ***
## CouncilAreaManningham City Council < 2e-16 ***
## CouncilAreaMaribyrnong City Council 8.15e-10 ***
## CouncilAreaMaroondah City Council < 2e-16 ***
## CouncilAreaMelbourne City Council < 2e-16 ***
## CouncilAreaMelton City Council < 2e-16 ***
## CouncilAreaMitchell Shire Council 0.00532 **
## CouncilAreaMonash City Council < 2e-16 ***
## CouncilAreaMoonee Valley City Council 3.07e-12 ***
## CouncilAreaMoorabool Shire Council 0.23271
## CouncilAreaMoreland City Council < 2e-16 ***
## CouncilAreaMurrindindi Shire Council 0.00474 **
## CouncilAreaNillumbik Shire Council 0.24990
## CouncilAreaPort Phillip City Council < 2e-16 ***
## CouncilAreaStonnington City Council < 2e-16 ***
## CouncilAreaWhitehorse City Council < 2e-16 ***
## CouncilAreaWhittlesea City Council < 2e-16 ***
## CouncilAreaWyndham City Council < 2e-16 ***
## CouncilAreaYarra City Council < 2e-16 ***
## CouncilAreaYarra Ranges Shire Council 5.98e-06 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.2592 on 48367 degrees of freedom
## Multiple R-squared: 0.7281, Adjusted R-squared: 0.7278
## F-statistic: 2539 on 51 and 48367 DF, p-value: < 2.2e-16
data.frame(R2 = rsquare(model3_no_outliers, data = data_cleaned_no_outliers),
RMSE = rmse(model3_no_outliers, data = data_cleaned_no_outliers),
MAE =mae(model3, data = data_cleaned_no_outliers))
## R2 RMSE MAE
## 1 0.7280849 0.2590942 0.1976344
glance(model3_no_outliers)
## # A tibble: 1 × 12
## r.squared adj.r.squared sigma statistic p.value df logLik AIC BIC
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 0.728 0.728 0.259 2539. 0 51 -3311. 6727. 7193.
## # ℹ 3 more variables: deviance <dbl>, df.residual <int>, nobs <int>
ggplot(data_cleaned, aes(x = Distance, y = Log_Price))+
geom_point()+
geom_smooth(method = "lm", se = FALSE, color = "blue")+
geom_smooth(method = "lm", se = FALSE, data = data_cleaned_no_outliers, color = "red")
## `geom_smooth()` using formula = 'y ~ x'
## `geom_smooth()` using formula = 'y ~ x'
Influence of High-Leverage Points To ensure
the robustness of the model, an additional analysis was conducted by
removing high-leverage points identified in the residuals versus
leverage plots. High-leverage points can disproportionately influence
the model’s estimates, potentially skewing the results.
However, after removing these influential points and re-running the regression analysis, it was found that the model’s performance metrics, including R-squared, RMSE, and the significance of coefficients, remained largely unchanged. This indicates that the high-leverage points did not have a significant impact on the overall model estimates or conclusions.
The consistency in results suggests that the model is robust and that the relationships identified between the predictors and property prices are reliable, even when outliers are present.
In this comprehensive analysis of the Melbourne housing market, we explored the influence of both intrinsic property characteristics and external factors on property prices. The analysis revealed several key insights:
The regression analysis provided a deeper understanding of the relationships between these variables and property prices, with the model achieving an R² of 0.728, indicating a strong explanatory power. However, further analysis could involve exploring non-linear models or incorporating additional features to improve prediction accuracy.